CS7643 Deep Learning - Module 1 (Intor to Neural Networks)

Lesson 1: Linear Classifiers and Gradient Descent

Components of a Parametric Learning Algorithm

Multiclass linear classifier:

  • You can interpret this as 3 independent linear classifiers
  • Note that you can move bias term into the weight matrix, and a "1" at the end of the input. This results in a one matrix-vector multiplication. Why? Efficiency.

Interpreting a Linear Classifier

How we can interpret:

  • Algebraic: We can view an image as a linear algebra computation. Given a matrix of pixel values, each can be tied to a weight+bias.
  • Visual: We can convert weight vector back into the shape of the image and visualize. After optimization image looks like an average over all inputs of class.
  • Geometric: We can view linear classifier as a linear separator in a dimension space.

Limitations:

  • Not all classes are linearly separable.
  • XOR, bullseye function not linearly separable

Performance Measure for a Classifer

Converting Scores to Probabilities

Use softmax function to convert scores to probabilities:

$$ s = f(x,W) \\ P(Y=k|X=x)=\frac{e^{s_k}}{\sum_j e^{s_j}} $$

Steps:

  1. Given score vector $s$ (score for each class)
  2. Feed this vector through softmax function (2nd equation, right)
  3. For each class $k$, exponentiate the score for class $k$ and divide it by the sum of exponentions for all the scores.
  4. This normalize all of the scores from zero to one, which fits the definition of a probability.

Performance Measure

Example of multiclass SVM loss:

Takeaways:

  • Loss is zero if the score for $y_i$ (ground truth label) is greater or equal to the scores of all the other (incorrect) classes, plus 1. (First equation)
    • Goal is to maintain a margin between the score for the ground truth label and all the other possible categories.
  • If not, penalize it by how different it is from this margin.
    • Take the max over all classes that are not the ground truth and compute the loss (second equation)
    • Take its sum over all classes that aren't the ground truth and penalize the model whenever the score for the ground truth itself is not bigger by a particular margin.
  • This type of loss is called a hinge loss.

Example of how this is calculated with image class prediction:

  • Car score is highest (wrong)
  • Take max of either 0 or wrong score - grouth truth score + 1
  • In this case, car score will incur loss but frog will not (since it's lower than cat).
  • Final loss is the sum of each diff (in this case, 2.9)

Loss for classification

Cross-entropy and MLE:

  • For multiclass loss, we use the negative log probability of the softmax output:

Example using the image classification task:

Takeaways:

  1. Given raw scores, we exponentiate it (orange box)
  2. Then normalize to get probabilities (green box)
  3. Take the negative log of the probability assigned to the one that matches ground truth (e.g. "cat")
  4. We don't take other probabilities into account for the loss. Why?
    1. Probabilities inherently induce comptetition
    2. Optimization algorithm can both boost weights for cat and decrease weight for other classes.

Loss for regression / probability

Regularization term

Regularization accounts for choosing simpler models over complex ones. This is applied to the loss function:

L1 regularization norms the weight vector: $$ L_i = |y-Wx_i|^2+|W| $$

Linear Algebra View: Vector and Matrix Sizes

Size of weights and inputs

  • c = number of classes
  • d = dimensionality of input
  • $W$: Size = $[c\cdot(d+1)]$
  • $x$: Size = $[(d+1)\cdot 1]$

Dimensionality of derivatives - Conventions

Given a scalar $s \in \mathbb{R}^1$ and vector $v \in \mathbb{R}^m$:

  • Size of the partial derivative of $v$ wrt $s$: $\mathbb{R}^{m\cdot 1}$
    • ie. column vector of size $m$
  • Size of the partial derivative of $s$ wrt $v$ $\mathbb{R}^{1\cdot m}$
    • ie. row vector of size $m$

Q: Given 2 vectors, what's the size of $\frac{\nabla v^1}{\nabla v^2}$?

  • Answer: Jacobian matrix. Contains all the partial derivatives wrt v1 and v2.

Q: Given a scalar and a matrix, what's the size of $\frac{\nabla s}{\nabla M}$?

  • Answer: Matrix containing derivative of the scalar wrt each element in the matrix

Q: What is the size of $\frac{\nabla L}{\nabla W}$?

  • Answer: Loss (scalar) * Weights (matrix) will return a matrix containing derivative of the loss wrt to each element in the matrix.

Jacobians of Batches

Takeaway:

  • Batch size affects the dimensionality of our data
  • Tensors: multidimensional matrices
  • Can be unwieldy - instead, flatten input to vector and get a vector of derivatives.

How is Deep Learning Different?

What makes it different:

  1. Representation learning - takes in raw data rather than processed form (ie. histogram)
    • Conducts feature extraction - automatically pulling features from raw data
  2. Uses neural networks
  3. Can tackle various ML tasks (unsupervised, RL, etc)

What is deep learning:

Features: Trad vs Deep Learning

Features are engineered in traditional ML:

Features are automatically extracted in deep learning:

  • Key is hierarchical compositionality, meaning all data has some hierarchical order which neural networks can represent.

Example of features for image detection:

Building a complicated function

Representation of data is done through building up a network of simple functions into a complex network.

  • This idea is similar to boosting where we employ an ensemble of weak learners.
  • Significance: We can use any differentiable function to construct the network (sin, quadratic, log, etc).

End-to-End Learning

"End-to-end": Learning is applied to entire spectrum, from raw data -> feature extraction -> classification.

  • No handcrafted feature extraction, auto-extracted. Distinction between extracted features and classifier is blurry.
  • Key Idea: Learn very separable feature representation that can be easily classified.

Gradient Descent

Derivatives

Algorithm:

Applying batch gradient descent:

Convergence notes:

Computing Gradients

How to compute $\frac{\nabla L}{\nabla W_i}$?

  1. Manual differentiation
    • Labor intensive
    • Can't compute closed form solution for complex functions
  2. Symbolic differentiation
    • Similar to manual
  3. Numerical differentiation
    • Works for any function
    • Computationally expensive
  4. Automatic differentiation
    • Used by most DL libaries

Manual Differentiation

Derivation of update rule using squared loss:

  • Note on 2nd to last summation on update rule:

    The partial derivative of this summation (with respect to $w_j$ is really causing most of the terms to be zero because when $i$ is not equal to $j$ then none of those weights actually affect $w_j$.

(Some more context on getting the partial derivative of update rule above)

Update rule once we add non-linearity (sigmoid) - Gets more complex:

Decomposing a Function

Manual differentiation can get messy. We can decompose the complicated function into modular sub-blocks.

Distributed Representations

Key ideas:

  • No single neuron "encodes" everything
  • Groups of neurons work together as a distributed representation

Distributed representation: Toy example

  • One-hot labels of shapes (left)
  • By distributing characteristics (right), can efficiently cover a wide range of shapes by combining them.

Lesson 2: Neural Networks

Neural Network View of a Linear Classifier

Lesson 3: Optimization of Deep Neural Networks